Skip to content

feat(query-engine): add hash functions fnv, murmur3, md5, sha1, sha512, xxh3, xxh128#2887

Merged
albertlockett merged 12 commits into
open-telemetry:mainfrom
SzymonIwaniuk:additional-hash-algorithms
May 12, 2026
Merged

feat(query-engine): add hash functions fnv, murmur3, md5, sha1, sha512, xxh3, xxh128#2887
albertlockett merged 12 commits into
open-telemetry:mainfrom
SzymonIwaniuk:additional-hash-algorithms

Conversation

@SzymonIwaniuk
Copy link
Copy Markdown
Contributor

@SzymonIwaniuk SzymonIwaniuk commented May 7, 2026

Change Summary

Add seven hash functions to the OTAP query-engine, per #2834:

  • md5(): MD5 digest, backed by DataFusion's built-in crypto::md5 UDF.
  • sha1(): SHA-1 digest, implemented as a custom ScalarUDF using the sha1 crate (DataFusion has no SHA-1 equivalent).
  • sha512(): SHA-512 digest, backed by DataFusion's built-in crypto::sha512 UDF.
  • fnv(): FNV-1a 64-bit hash, implemented as a custom ScalarUDF.
  • murmur3(): MurmurHash3 32-bit hash, implemented as a custom ScalarUDF.
  • xxh3(): XXH3 64-bit hash, implemented as a custom ScalarUDF using xxhash-rust.
  • xxh128(): XXH3 128-bit hash, implemented as a custom ScalarUDF using xxhash-rust.
logs | extend attributes["hash"] = encode(md5(attributes["body"]), "hex")
logs | extend attributes["bucket"] = murmur3(attributes["service.name"])
logs | extend attributes["sig"] = xxh3(attributes["message"])

Files

  • consts.rs, parser.rs: register all 7 functions as external functions with param_placeholders(1).
  • pipeline/expr.rs: import DataFusion's md5(), sha512() and the new custom UDFs, add them to DataFusionFunctionDef::from_func_name with requires_dict_downcast: true.
  • pipeline/functions/fnv.rs (new): custom FnvHashFunc - FNV-1a 64-bit, returns Int64.
  • pipeline/functions/murmur3.rs (new): custom Murmur3HashFunc - MurmurHash3 32-bit, returns Int64.
  • pipeline/functions/sha1.rs (new): custom Sha1Func using the sha1 crate, returns Binary.
  • pipeline/functions/xxh3.rs (new): custom Xxh3Func using xxhash-rust, returns Int64.
  • pipeline/functions/xxh128.rs (new): custom Xxh128Func using xxhash-rust, returns Binary (16 bytes big-endian).
  • pipeline/functions.rs: module declarations and make_udf_function! registrations.
  • crates/query-engine/Cargo.toml: add sha1 and xxhash-rust workspace dependencies.

What issue does this PR close?

How are these changes tested?

Two layers of tests, all in this PR:

  • Unit tests in each custom UDF module (fnv, murmur3, sha1, xxh3, xxh128): verify scalar string input, binary input, and null handling.
  • pipeline::assign end-to-end tests for every function, exercising both OPL and KQL parsers through the full pipeline. Binary-returning functions (md5, sha1, sha512, xxh128) are wrapped with encode(..., "hex") and
    the output hex string is asserted. Integer-returning functions (fnv, murmur3, xxh3) assert the Int64 value directly.
    cargo xtask quick-check passes clean.

Are there any user-facing changes?

Yes users of the transform processor / query-engine can now call these hash functions in OPL/KQL programs:

logs | extend attributes["id"] = encode(sha1(attributes["body"]), "hex")
logs | extend attributes["bucket"] = fnv(attributes["service.name"])
logs | extend attributes["sig"] = xxh3(attributes["message"])
logs | extend attributes["hash"] = encode(xxh128(attributes["message"]), "hex")

@SzymonIwaniuk SzymonIwaniuk requested a review from a team as a code owner May 7, 2026 08:40
@github-actions github-actions Bot added rust Pull requests that update Rust code query-engine Query Engine / Transform related tasks query-engine-columnar Columnar query engine which uses DataFusion to process OTAP Batches labels May 7, 2026
@SzymonIwaniuk SzymonIwaniuk changed the title Additional hash algorithms feat(query-engine): add hash functions fnv, murmur3, md5, sha1, sha512, xxh3, xxh128 May 7, 2026
Comment thread rust/otap-dataflow/Cargo.toml Outdated
Comment thread rust/otap-dataflow/crates/query-engine/src/pipeline/expr.rs Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

❌ Patch coverage is 76.54185% with 213 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.22%. Comparing base (efe20b1) to head (7696dda).
⚠️ Report is 1 commits behind head on main.

❌ Your patch check has failed because the patch coverage (76.54%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2887      +/-   ##
==========================================
- Coverage   86.27%   86.22%   -0.05%     
==========================================
  Files         715      720       +5     
  Lines      272064   272971     +907     
==========================================
+ Hits       234712   235374     +662     
- Misses      36828    37073     +245     
  Partials      524      524              
Components Coverage Δ
otap-dataflow 87.18% <76.54%> (-0.06%) ⬇️
query_abstraction 80.61% <ø> (ø)
query_engine 90.73% <ø> (ø)
otel-arrow-go 52.45% <ø> (ø)
quiver 92.25% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread rust/otap-dataflow/crates/query-engine/src/pipeline/assign.rs
Comment thread rust/otap-dataflow/crates/query-engine/src/pipeline/functions.rs
Comment on lines +130 to +136
ScalarValue::Utf8(v) | ScalarValue::LargeUtf8(v) => {
Ok(v.as_deref().map(|s| murmur3_32(s.as_bytes())))
}
ScalarValue::Utf8View(v) => Ok(v.as_deref().map(|s| murmur3_32(s.as_bytes()))),
ScalarValue::Binary(v) | ScalarValue::LargeBinary(v) => {
Ok(v.as_ref().map(|b| murmur3_32(b)))
}
Copy link
Copy Markdown
Member

@albertlockett albertlockett May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we support different types when handling a scalar versus handling an Array?

For the Arrray case, it looks like we only support Utf8, LargeUtf8 and Binary. I think these supported types should maybe be consistent, unless there's a good reason for them not to be? Same comment for fnv, sha1 and xxh

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, made the array path consistent with the scalar path by adding Utf8View and LargeBinary support to hash_array in all five udf files.

@SzymonIwaniuk
Copy link
Copy Markdown
Contributor Author

Hey, all changes addressed and cargo test -p otap-df-query-engine unit tests pass locally.

test result: ok. 623 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.14s

   Doc-tests otap_df_query_engine

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

@SzymonIwaniuk
Copy link
Copy Markdown
Contributor Author

Hey @albertlockett, I fixed the bad merge in assign.rs, could you please re-run CI? I ran the full suite before pushing, so it should finally work.

@albertlockett albertlockett added this pull request to the merge queue May 12, 2026
Merged via the queue into open-telemetry:main with commit ace4972 May 12, 2026
84 of 85 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

query-engine Query Engine / Transform related tasks query-engine-columnar Columnar query engine which uses DataFusion to process OTAP Batches rust Pull requests that update Rust code

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants